Intro

In this competition, we need to predict the price of houses in Ames, Iowa. We have a dataset of 80 features we have to analyze to understand which features will help to predict and which are just 'white noise'.
In this assignment, I will use K-Fold Cross-Validation to validate each step in the tuning of the hyper-parameters and to select the best set of features for the final model, and as well, forward feature selection method to choose the features.

Package Updates

In [ ]:
!pip install --upgrade plotly
Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.14.1)
Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from plotly) (1.15.0)
Requirement already satisfied, skipping upgrade: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly) (1.3.3)

Import

In [ ]:
import tensorflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from tqdm.auto import tqdm

from scipy import stats
from scipy.stats import norm


# sklearn imports
from sklearn import metrics
from sklearn import pipeline
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFECV
from sklearn.model_selection import RepeatedKFold
In [ ]:
train_df = pd.DataFrame(pd.read_csv('train.csv'))
test_df = pd.DataFrame(pd.read_csv('test.csv'))
In [ ]:
display(train_df)
display(test_df)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating ... CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA ... Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA ... Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA ... Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA ... Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA ... Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 5 1999 2000 Gable CompShg VinylSd VinylSd None 0.0 TA TA PConc Gd TA No Unf 0 Unf 0 953 953 GasA ... Y SBrkr 953 694 0 1647 0 0 2 1 3 1 TA 7 Typ 1 TA Attchd 1999.0 RFn 2 460 TA TA Y 0 40 0 0 0 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub Inside Gtl NWAmes Norm Norm 1Fam 1Story 6 6 1978 1988 Gable CompShg Plywood Plywood Stone 119.0 TA TA CBlock Gd TA No ALQ 790 Rec 163 589 1542 GasA ... Y SBrkr 2073 0 0 2073 1 0 2 0 3 1 TA 7 Min1 2 TA Attchd 1978.0 Unf 2 500 TA TA Y 349 0 0 0 0 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub Inside Gtl Crawfor Norm Norm 1Fam 2Story 7 9 1941 2006 Gable CompShg CemntBd CmentBd None 0.0 Ex Gd Stone TA Gd No GLQ 275 Unf 0 877 1152 GasA ... Y SBrkr 1188 1152 0 2340 0 0 2 0 4 1 Gd 9 Typ 2 Gd Attchd 1941.0 RFn 1 252 TA TA Y 0 60 0 0 0 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub Inside Gtl NAmes Norm Norm 1Fam 1Story 5 6 1950 1996 Hip CompShg MetalSd MetalSd None 0.0 TA TA CBlock TA TA Mn GLQ 49 Rec 1029 0 1078 GasA ... Y FuseA 1078 0 0 1078 1 0 1 0 2 1 Gd 5 Typ 0 NaN Attchd 1950.0 Unf 1 240 TA TA Y 366 0 112 0 0 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub Inside Gtl Edwards Norm Norm 1Fam 1Story 5 6 1965 1965 Gable CompShg HdBoard HdBoard None 0.0 Gd TA CBlock TA TA No BLQ 830 LwQ 290 136 1256 GasA ... Y SBrkr 1256 0 0 1256 1 0 1 1 3 1 TA 6 Typ 0 NaN Attchd 1965.0 Fin 1 276 TA TA Y 736 68 0 0 0 0 NaN NaN NaN 0 6 2008 WD Normal 147500

1460 rows × 81 columns

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub Inside Gtl NAmes Feedr Norm 1Fam 1Story 5 6 1961 1961 Gable CompShg VinylSd VinylSd None 0.0 TA TA CBlock TA TA No Rec 468.0 LwQ 144.0 270.0 882.0 GasA TA Y SBrkr 896 0 0 896 0.0 0.0 1 0 2 1 TA 5 Typ 0 NaN Attchd 1961.0 Unf 1.0 730.0 TA TA Y 140 0 0 0 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 6 1958 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace 108.0 TA TA CBlock TA TA No ALQ 923.0 Unf 0.0 406.0 1329.0 GasA TA Y SBrkr 1329 0 0 1329 0.0 0.0 1 1 3 1 Gd 6 Typ 0 NaN Attchd 1958.0 Unf 1.0 312.0 TA TA Y 393 36 0 0 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 5 5 1997 1998 Gable CompShg VinylSd VinylSd None 0.0 TA TA PConc Gd TA No GLQ 791.0 Unf 0.0 137.0 928.0 GasA Gd Y SBrkr 928 701 0 1629 0.0 0.0 2 1 3 1 TA 6 Typ 1 TA Attchd 1997.0 Fin 2.0 482.0 TA TA Y 212 34 0 0 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 6 1998 1998 Gable CompShg VinylSd VinylSd BrkFace 20.0 TA TA PConc TA TA No GLQ 602.0 Unf 0.0 324.0 926.0 GasA Ex Y SBrkr 926 678 0 1604 0.0 0.0 2 1 3 1 Gd 7 Typ 1 Gd Attchd 1998.0 Fin 2.0 470.0 TA TA Y 360 36 0 0 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 1992 1992 Gable CompShg HdBoard HdBoard None 0.0 Gd TA PConc Gd TA No ALQ 263.0 Unf 0.0 1017.0 1280.0 GasA Ex Y SBrkr 1280 0 0 1280 0.0 0.0 2 0 2 1 Gd 5 Typ 0 NaN Attchd 1992.0 RFn 2.0 506.0 TA TA Y 0 82 0 0 144 0 NaN NaN NaN 0 1 2010 WD Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 2915 160 RM 21.0 1936 Pave NaN Reg Lvl AllPub Inside Gtl MeadowV Norm Norm Twnhs 2Story 4 7 1970 1970 Gable CompShg CemntBd CmentBd None 0.0 TA TA CBlock TA TA No Unf 0.0 Unf 0.0 546.0 546.0 GasA Gd Y SBrkr 546 546 0 1092 0.0 0.0 1 1 3 1 TA 5 Typ 0 NaN NaN NaN NaN 0.0 0.0 NaN NaN Y 0 0 0 0 0 0 NaN NaN NaN 0 6 2006 WD Normal
1455 2916 160 RM 21.0 1894 Pave NaN Reg Lvl AllPub Inside Gtl MeadowV Norm Norm TwnhsE 2Story 4 5 1970 1970 Gable CompShg CemntBd CmentBd None 0.0 TA TA CBlock TA TA No Rec 252.0 Unf 0.0 294.0 546.0 GasA TA Y SBrkr 546 546 0 1092 0.0 0.0 1 1 3 1 TA 6 Typ 0 NaN CarPort 1970.0 Unf 1.0 286.0 TA TA Y 0 24 0 0 0 0 NaN NaN NaN 0 4 2006 WD Abnorml
1456 2917 20 RL 160.0 20000 Pave NaN Reg Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1Story 5 7 1960 1996 Gable CompShg VinylSd VinylSd None 0.0 TA TA CBlock TA TA No ALQ 1224.0 Unf 0.0 0.0 1224.0 GasA Ex Y SBrkr 1224 0 0 1224 1.0 0.0 1 0 4 1 TA 7 Typ 1 TA Detchd 1960.0 Unf 2.0 576.0 TA TA Y 474 0 0 0 0 0 NaN NaN NaN 0 9 2006 WD Abnorml
1457 2918 85 RL 62.0 10441 Pave NaN Reg Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam SFoyer 5 5 1992 1992 Gable CompShg HdBoard Wd Shng None 0.0 TA TA PConc Gd TA Av GLQ 337.0 Unf 0.0 575.0 912.0 GasA TA Y SBrkr 970 0 0 970 0.0 1.0 1 0 3 1 TA 6 Typ 0 NaN NaN NaN NaN 0.0 0.0 NaN NaN Y 80 32 0 0 0 0 NaN MnPrv Shed 700 7 2006 WD Normal
1458 2919 60 RL 74.0 9627 Pave NaN Reg Lvl AllPub Inside Mod Mitchel Norm Norm 1Fam 2Story 7 5 1993 1994 Gable CompShg HdBoard HdBoard BrkFace 94.0 TA TA PConc Gd TA Av LwQ 758.0 Unf 0.0 238.0 996.0 GasA Ex Y SBrkr 996 1004 0 2000 0.0 0.0 2 1 3 1 TA 9 Typ 1 TA Attchd 1993.0 Fin 3.0 650.0 TA TA Y 190 48 0 0 0 0 NaN NaN NaN 0 11 2006 WD Normal

1459 rows × 80 columns

Let's see how the data looks.

In [ ]:
display(train_df.describe(include="O").T)
display(train_df.describe().T)
count unique top freq
MSZoning 1460 5 RL 1151
Street 1460 2 Pave 1454
Alley 91 2 Grvl 50
LotShape 1460 4 Reg 925
LandContour 1460 4 Lvl 1311
Utilities 1460 2 AllPub 1459
LotConfig 1460 5 Inside 1052
LandSlope 1460 3 Gtl 1382
Neighborhood 1460 25 NAmes 225
Condition1 1460 9 Norm 1260
Condition2 1460 8 Norm 1445
BldgType 1460 5 1Fam 1220
HouseStyle 1460 8 1Story 726
RoofStyle 1460 6 Gable 1141
RoofMatl 1460 8 CompShg 1434
Exterior1st 1460 15 VinylSd 515
Exterior2nd 1460 16 VinylSd 504
MasVnrType 1452 4 None 864
ExterQual 1460 4 TA 906
ExterCond 1460 5 TA 1282
Foundation 1460 6 PConc 647
BsmtQual 1423 4 TA 649
BsmtCond 1423 4 TA 1311
BsmtExposure 1422 4 No 953
BsmtFinType1 1423 6 Unf 430
BsmtFinType2 1422 6 Unf 1256
Heating 1460 6 GasA 1428
HeatingQC 1460 5 Ex 741
CentralAir 1460 2 Y 1365
Electrical 1459 5 SBrkr 1334
KitchenQual 1460 4 TA 735
Functional 1460 7 Typ 1360
FireplaceQu 770 5 Gd 380
GarageType 1379 6 Attchd 870
GarageFinish 1379 3 Unf 605
GarageQual 1379 5 TA 1311
GarageCond 1379 5 TA 1326
PavedDrive 1460 3 Y 1340
PoolQC 7 3 Gd 3
Fence 281 4 MnPrv 157
MiscFeature 54 4 Shed 49
SaleType 1460 9 WD 1267
SaleCondition 1460 6 Normal 1198
count mean std min 25% 50% 75% max
Id 1460.0 730.500000 421.610009 1.0 365.75 730.5 1095.25 1460.0
MSSubClass 1460.0 56.897260 42.300571 20.0 20.00 50.0 70.00 190.0
LotFrontage 1201.0 70.049958 24.284752 21.0 59.00 69.0 80.00 313.0
LotArea 1460.0 10516.828082 9981.264932 1300.0 7553.50 9478.5 11601.50 215245.0
OverallQual 1460.0 6.099315 1.382997 1.0 5.00 6.0 7.00 10.0
OverallCond 1460.0 5.575342 1.112799 1.0 5.00 5.0 6.00 9.0
YearBuilt 1460.0 1971.267808 30.202904 1872.0 1954.00 1973.0 2000.00 2010.0
YearRemodAdd 1460.0 1984.865753 20.645407 1950.0 1967.00 1994.0 2004.00 2010.0
MasVnrArea 1452.0 103.685262 181.066207 0.0 0.00 0.0 166.00 1600.0
BsmtFinSF1 1460.0 443.639726 456.098091 0.0 0.00 383.5 712.25 5644.0
BsmtFinSF2 1460.0 46.549315 161.319273 0.0 0.00 0.0 0.00 1474.0
BsmtUnfSF 1460.0 567.240411 441.866955 0.0 223.00 477.5 808.00 2336.0
TotalBsmtSF 1460.0 1057.429452 438.705324 0.0 795.75 991.5 1298.25 6110.0
1stFlrSF 1460.0 1162.626712 386.587738 334.0 882.00 1087.0 1391.25 4692.0
2ndFlrSF 1460.0 346.992466 436.528436 0.0 0.00 0.0 728.00 2065.0
LowQualFinSF 1460.0 5.844521 48.623081 0.0 0.00 0.0 0.00 572.0
GrLivArea 1460.0 1515.463699 525.480383 334.0 1129.50 1464.0 1776.75 5642.0
BsmtFullBath 1460.0 0.425342 0.518911 0.0 0.00 0.0 1.00 3.0
BsmtHalfBath 1460.0 0.057534 0.238753 0.0 0.00 0.0 0.00 2.0
FullBath 1460.0 1.565068 0.550916 0.0 1.00 2.0 2.00 3.0
HalfBath 1460.0 0.382877 0.502885 0.0 0.00 0.0 1.00 2.0
BedroomAbvGr 1460.0 2.866438 0.815778 0.0 2.00 3.0 3.00 8.0
KitchenAbvGr 1460.0 1.046575 0.220338 0.0 1.00 1.0 1.00 3.0
TotRmsAbvGrd 1460.0 6.517808 1.625393 2.0 5.00 6.0 7.00 14.0
Fireplaces 1460.0 0.613014 0.644666 0.0 0.00 1.0 1.00 3.0
GarageYrBlt 1379.0 1978.506164 24.689725 1900.0 1961.00 1980.0 2002.00 2010.0
GarageCars 1460.0 1.767123 0.747315 0.0 1.00 2.0 2.00 4.0
GarageArea 1460.0 472.980137 213.804841 0.0 334.50 480.0 576.00 1418.0
WoodDeckSF 1460.0 94.244521 125.338794 0.0 0.00 0.0 168.00 857.0
OpenPorchSF 1460.0 46.660274 66.256028 0.0 0.00 25.0 68.00 547.0
EnclosedPorch 1460.0 21.954110 61.119149 0.0 0.00 0.0 0.00 552.0
3SsnPorch 1460.0 3.409589 29.317331 0.0 0.00 0.0 0.00 508.0
ScreenPorch 1460.0 15.060959 55.757415 0.0 0.00 0.0 0.00 480.0
PoolArea 1460.0 2.758904 40.177307 0.0 0.00 0.0 0.00 738.0
MiscVal 1460.0 43.489041 496.123024 0.0 0.00 0.0 0.00 15500.0
MoSold 1460.0 6.321918 2.703626 1.0 5.00 6.0 8.00 12.0
YrSold 1460.0 2007.815753 1.328095 2006.0 2007.00 2008.0 2009.00 2010.0
SalePrice 1460.0 180921.195890 79442.502883 34900.0 129975.00 163000.0 214000.00 755000.0

Now, let's see the type of features.

In [ ]:
def count_numerical_categorical(df, drop_list):
  count = 0
  for key in df.drop(drop_list, axis=1).keys():
    if np.issubdtype(df[key].dtype, np.number):
      count += 1
  cat = len(df.drop(drop_list, axis=1).keys()) - count
  return count, cat
In [ ]:
num, cat = count_numerical_categorical(train_df, ["Id", "SalePrice"])
print("There are {} numerical features, and {} categorical\n".format(num, cat))
There are 36 numerical features, and 43 categorical

Data Exploration

Let's see the distribution of the SalePrice to get perspective on the prices.

In [ ]:
plt.figure(figsize=(16, 8))
sns.histplot(train_df['SalePrice'], kde=True, color="blue")
plt.title('Distribution of Sales Price', fontsize=18)
plt.show()

We can see that SalePrice does not follow a normal distribution, it will have to normalize later.

Numerical Feature Analysis

Let's see the correlation of features.

In [ ]:
plt.figure(figsize=(30, 18))
sns.heatmap(train_df.drop("Id", axis=1).corr(), annot=True, cmap=sns.color_palette("rocket_r", as_cmap=True))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f41559642e8>

Few things are very noticable:

  • GarageYrBlt which represent the year the garage was build is highly correlated (0.82) with YearBuilt which represent the build year of the house. It means that in most of properties in the dataset, the garage was build at the same time as the house. In that case, we can drop the GarageYrBlt feature as we have too many features.
  • 1stFlrSF which indicates the square foot of the 1st floor of a house is highly correlated (0.81) with TotalBsmtSF which represents the square foot of the basement. It means that in most houses the basement has the same size as the upper floor.
  • GarageArea and GarageCars are highly correlated (0.88), we might want to drop one of them as well.
In [ ]:
def drop_features(dataframes, features):
  for df in dataframes:
    df.drop(features, axis=1, inplace=True)
  return dataframes

def replace_feature_vals(dataframes, feature, new_dict):
  for df in dataframes:
    df[feature] = df[feature].replace(new_dict)
  return dataframes

def combine_categorical_by_threshold(dataframes, feature, threshhold):
  for df in dataframes:
    counts = df[feature].value_counts()
    counts = zip(df[feature].value_counts().index, counts)
    for name, val in counts:
      if val > threshhold:
        continue
      df[feature] = df[feature].replace({name : "Other"})
  return dataframes
In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["GarageYrBlt", "TotalBsmtSF", "GarageArea"])

Let's plot the disributes of the numerical features.

In [ ]:
def get_numerical(df, exclude=[]):
  df = df.drop(exclude, axis=1)
  numerical = []
  for key in df.keys():
    if np.issubdtype(df[key].dtype, np.number):
      numerical += [key]
  return numerical
In [ ]:
def plot_numerical(df, features):
  df = df[features]
  cols = 4
  rows =  int(len(df.keys()) / cols) + 1
  fig = make_subplots(cols=cols, rows=rows)
  for i, key in enumerate(features):
    x, y = i//cols + 1, i%cols + 1
    graph = go.Histogram(x=df[key], name=key)
    fig.add_trace(graph, row=x, col=y)
    fig.update_xaxes(title_text=key, row=x, col=y)

  fig.update_layout(height=2000, width=1000, title_text="Distribution of Features")
  fig.show()
In [ ]:
plot_numerical(train_df, get_numerical(train_df, ["Id", "SalePrice"]))

Okay, let's see how each numerical feature affect the price of the house.

In [ ]:
def plot_numerical_by_target(df, target, features):
  df = df[features + [target]]
  cols = 5
  rows =  int(len(features) / cols) + 1
  fig, ax = plt.subplots(rows, cols)
  colors = sns.color_palette("rocket_r", as_cmap=True)
  for i, key in enumerate(features):
    sns.scatterplot(data=df, x=key, y=target, ax=ax[i//cols, i%cols], hue=target, palette=colors)
    ax[i//cols, i%cols].xaxis.set_tick_params(rotation=45)

  for i in range(len(features), rows*cols):   # delete the empty graphs
    fig.delaxes(ax.flatten()[i])

  plt.subplots_adjust(top=1.5, bottom=0, left=0, right=1)
  fig.set_size_inches(20, 20)
  fig.show()
In [ ]:
plot_numerical_by_target(train_df, "SalePrice", get_numerical(train_df, ["Id", "SalePrice"]))

From these graphs we can see few interesting things:

  • MSSubClass seem to have no major impact on the price, and it has low correlation with the price.
  • By looking at LotArea, most of the values between 0-50K square feet, so we can not understand how large lot areas will impact the price.
  • LotArea which represent the size of the whole property does not seem to have impact on the price of the property, unlike we would assume.
  • The overall quality of the house does affect it's price.
  • The overall condition of the house has small impact on the price. And the most expensive houses rated with average condition.
  • Year of building and remodeling has small impact on the price. And they seem to affect the price in simillar ways. It could make sense in which when someone remodeling it's house, it imedietly increase the house's price. We can try to combine them into one feature YearLstCnst (Year of last constructions). Also we can see in the description of the data that the YearRemodAdd is same as construction if there were no construction.
  • BsmtFinSF2 appeare to have most of it's values at zero, it has low correlation with the price.
  • First and second floors sizes seem to have a good impact on the price, also they look simillar to general living area, we will check it further.
  • In terms of amount of fireplaces, it seem that the amount does not a lot important, but if the propery does have it or not seem to be important.
  • Amount of cars could fit in the garage seem to be important on house's price.
  • We do not have enough knowledge on houses with pool.
  • In LowQualFinSF feature most of the values are 0, so we can not understand something about it.
  • Year and month of sold does not seem to have big impact on the price.
In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["MSSubClass", "BsmtFinSF2", "OverallCond", "LowQualFinSF"])

YearBuilt and YearRemodAdd

As said before, we will combined these feature into one feature YearLstCnst.

In [ ]:
for df in [train_df, test_df]:
  df["YearLstCnst"] = df[["YearBuilt", "YearRemodAdd"]].max(axis=1)
train_df, test_df = drop_features([train_df, test_df], ["YearBuilt", "YearRemodAdd"])
display(train_df["YearLstCnst"])
0       2003
1       1976
2       2002
3       1970
4       2000
        ... 
1455    2000
1456    1988
1457    2006
1458    1996
1459    1965
Name: YearLstCnst, Length: 1460, dtype: int64
In [ ]:
sns.scatterplot(data=train_df, x="YearLstCnst", y="SalePrice", hue="SalePrice")
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f414f8dcdd8>

Fireplaces

We will convert it to a binary feature which represent whether a house has fireplce or not.

In [ ]:
for df in [train_df, test_df]:
  df.loc[df["Fireplaces"] > 0, "Fireplaces"] = 1

display(train_df[["Fireplaces"]].describe().transpose())
print("Train uniques: {}, Test uniques: {}".format(train_df["Fireplaces"].nunique(), test_df["Fireplaces"].nunique()))
count mean std min 25% 50% 75% max
Fireplaces 1460.0 0.527397 0.49942 0.0 0.0 1.0 1.0 1.0
Train uniques: 2, Test uniques: 2

1stFlrSF, 2ndFlrSF and GrLivArea

The second floot seem to have some impact on the price, they there is a second floor. The first floor looks pretty familiar with the general living area and all of them have high correlation.
At this point, we will try to drop both of 1stFlrSF and 2ndFlrSF, as GrLivArea seem to disribuite more familiar to a noraml distribute.

In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["1stFlrSF", "2ndFlrSF"])

Basement's Bathrooms

Most of the properties does not have half bathrooms, almost none of them has 2. At this point we will drop this feature.
Almost none of the basements has more than one full bath, so we will convert this feature into binary one.

In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["BsmtHalfBath"])
for df in [train_df, test_df]:
  df.loc[df["BsmtFullBath"] > 0, "BsmtFullBath"] = 1

display(train_df[["BsmtFullBath"]].describe().transpose())
print("Train uniques: {}, Test uniques: {}".format(train_df["BsmtFullBath"].nunique(), test_df["BsmtFullBath"].nunique()))
count mean std min 25% 50% 75% max
BsmtFullBath 1460.0 0.413699 0.492665 0.0 0.0 0.0 1.0 1.0
Train uniques: 2, Test uniques: 2

TotRmsAbvGr and BedroomAbvGr

They both high correlated, TotRmsAbvGr has much higher correlation with the price, and it's distribution seem to be more noraml. We will drop BedroomAbvGr at this point.

In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["BedroomAbvGr"])

KitchenAbvGd

Almost all of the houses has one kitchen above ground, and the houses with two kitchens does not have big impact on the price.

In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["KitchenAbvGr"])

YrSold and MoSold

We can see that the year and month of sold does not correlated with any feature in the data, even not with the price. Also, if we would like to determine with our model a house that have been sold later than the houses we have in our data, or even determine a price inthe future, the year or the month would not help us to predict the right price.

In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["YrSold", "MoSold"])

EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea and MiscVal

All of them does not have a lot of values and they seem to do not have much impact on the price, we will drop them.

In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["EnclosedPorch", "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal"])

Categorical Features Analysis

Now let's count plot the categorical features. And violin plot to see how SalePrice is dependent on each feature.

In [ ]:
def get_categorical(df, exclude=[]):
  df = df.drop(exclude, axis=1)
  numerical = []
  for key in df.keys():
    if not np.issubdtype(df[key].dtype, np.number):
      numerical += [key]
  return numerical
In [ ]:
def plot_categorical(df, target, features):
  df = df[features + [target]]
  cols = 2
  rows =  len(features)
  fig, ax = plt.subplots(rows, cols)
  for i, key in enumerate(features):
    sub_ax = sns.countplot(data=df, x=key, ax=ax[i, 0], palette="Set2")
    for p in sub_ax.patches:
      _x = p.get_x() + p.get_width() / 2
      _y = p.get_y() + p.get_height() / 2
      value = "{:d}".format(p.get_height())
      sub_ax.text(_x, _y, value, ha="center")
    ax[i, 0].xaxis.set_tick_params(rotation=45)
    sns.violinplot(x=key, y=target, data=df, ax=ax[i, 1], palette="Set2")
    ax[i, 1].set_title("{} VS. {}".format(key, target))
    ax[i, 1].xaxis.set_tick_params(rotation=45)

  plt.subplots_adjust(top=2.5, bottom=0, left=0, right=1)
  fig.set_size_inches(20, 100)
  fig.show()
In [ ]:
plot_categorical(train_df, "SalePrice", get_categorical(train_df, ["Id", "SalePrice"]))

We can see that there are a lot of categorical features that almost all values appear withing one value. This could lead to 'white noise', and we might want to drop few of them because over features.

Dropping Features

  • Street: Type of road access to property, only 6 values are different from the overs. At this point, we will drop it.
  • Utilities: Type of utilities available, all values exept of 1 are AllPub.
  • Condition2:Proximity to various conditions, 1445 out of 1460 values are Norm.
  • RoofMatl: Roof material, 1434 of 1460 values are CompShg which is the standart.
  • GarageCond: Looks very simillar to GarageQual and there are less distribution among this feature, at this point we will drop it.
In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["Street", "Utilities", "Condition2", "RoofMatl", "GarageCond"])

Now let's handle over features:

LandSploe

Has 3 optional values, but one value appears a lot more than the two others, also they seem to affect SalePrice pretty much the same way. We can combined these two options and convert this features to a binnary feature to indicate whether the landslope is gentle or not.

In [ ]:
train_df, test_df = replace_feature_vals([train_df, test_df], "LandSlope", {"Gtl":1, "Mod":0, "Sev":0})
display(train_df[["LandSlope"]].describe().transpose())
display(test_df[["LandSlope"]].describe().transpose())

print("Train uniques: {}, Test uniques: {}".format(train_df["LandSlope"].nunique(), test_df["LandSlope"].nunique()))
count mean std min 25% 50% 75% max
LandSlope 1460.0 0.946575 0.224956 0.0 1.0 1.0 1.0 1.0
count mean std min 25% 50% 75% max
LandSlope 1459.0 0.95682 0.203332 0.0 1.0 1.0 1.0 1.0
Train uniques: 2, Test uniques: 2

HouseStyle

It has these values:

  • 1Story - One story
  • 1.5Fin - One and one-half story: 2nd level finished
  • 1.5Unf - One and one-half story: 2nd level unfinished
  • 2Story - Two story
  • 2.5Fin - Two and one-half story: 2nd level finished
  • 2.5Unf - Two and one-half story: 2nd level unfinished
  • SFoyer - Split Foyer
  • SLvl - Split Level

We can see that all the 1 and 1.5 floors distribuite simillar, as well as 2 and 2.5 floors, we will combine each.

In [ ]:
train_df, test_df = replace_feature_vals([train_df, test_df], "HouseStyle", {"1.5Fin":"1Story", "1.5Unf":"1Story", "2.5Fin":"2Story", "2.5Unf":"2Story", "SLvl":"SFoyer"})
display(train_df[["HouseStyle"]].describe().transpose())
display(test_df[["HouseStyle"]].describe().transpose())
count unique top freq
HouseStyle 1460 3 1Story 894
count unique top freq
HouseStyle 1459 3 1Story 910

RoofStyle

Has 6 options, 2 values are common and the other less.
In terms of SalePrice the 4 values Gambrel, Mansard, Flat, and Shed affect SalePrice in simillar way, we will combined them to one option Other.

In [ ]:
train_df, test_df = combine_categorical_by_threshold([train_df, test_df], "RoofStyle", 13)
display(train_df[["RoofStyle"]].describe().transpose())
display(test_df[["RoofStyle"]].describe().transpose())

print("Train uniques: {}, Test uniques: {}".format(train_df["RoofStyle"].nunique(), test_df["RoofStyle"].nunique()))
count unique top freq
RoofStyle 1460 3 Gable 1141
count unique top freq
RoofStyle 1459 3 Gable 1169
Train uniques: 3, Test uniques: 3

Exterior1st

We can see 5 options are not common, we will combine them as Other.

In [ ]:
train_df, test_df = train_df, test_df = combine_categorical_by_threshold([train_df, test_df], "Exterior1st", 10)
display(train_df[["Exterior1st"]].describe().transpose())
display(test_df[["Exterior1st"]].describe().transpose())
count unique top freq
Exterior1st 1460 11 VinylSd 515
count unique top freq
Exterior1st 1458 11 VinylSd 510

Exterior2nd

As well, 6 options have less than 10 values, we will combine them as Other.

In [ ]:
train_df, test_df = combine_categorical_by_threshold([train_df, test_df], "Exterior2nd", 10)
display(train_df[["Exterior2nd"]].describe().transpose())
display(test_df[["Exterior2nd"]].describe().transpose())
count unique top freq
Exterior2nd 1460 11 VinylSd 504
count unique top freq
Exterior2nd 1458 12 VinylSd 510

We can see that the test dataset has some value in Exterior2nd that the train set does not have, let's see which one.

In [ ]:
print("train:", pd.unique(train_df["Exterior2nd"].sort_values()))
print()
print("test:", pd.unique(test_df["Exterior2nd"].sort_values()))
train: ['AsbShng' 'BrkFace' 'CmentBd' 'HdBoard' 'MetalSd' 'Other' 'Plywood'
 'Stucco' 'VinylSd' 'Wd Sdng' 'Wd Shng']

test: ['AsbShng' 'Brk Cmn' 'BrkFace' 'CmentBd' 'HdBoard' 'MetalSd' 'Other'
 'Plywood' 'Stucco' 'VinylSd' 'Wd Sdng' 'Wd Shng' nan]

Brk Cmn is the 'bad' value, we can not determine how it will affect the price of the house, for now, we will replace it with 'Other'.
Later on, we will try to convert it with a value based on houses sold in the same area or houses of the same class.

In [ ]:
train_df, test_df = replace_feature_vals([train_df, test_df], "Exterior2nd", {"Brk Cmn":"Other"})
display(train_df[["Exterior2nd"]].describe().transpose())
display(test_df[["Exterior2nd"]].describe().transpose())
count unique top freq
Exterior2nd 1460 11 VinylSd 504
count unique top freq
Exterior2nd 1458 11 VinylSd 510

ExterCond

Indicates on the exterior material condition, Good and Average conditions appear to affect SalePrice pretty much the same, and Poor has only one appearance, and it appears to be close to the Fair mean.

In [ ]:
train_df, test_df = replace_feature_vals([train_df, test_df], "ExterCond", {"Po":"Fa", "Gd":"TA"})
display(train_df[["ExterCond"]].describe().transpose())
display(test_df[["ExterCond"]].describe().transpose())
count unique top freq
ExterCond 1460 3 TA 1428
count unique top freq
ExterCond 1459 3 TA 1409

BsmtCond

The Poor value appears only twice in the data, it seems to be similar to fair so we will combine them.

In [ ]:
train_df, test_df = replace_feature_vals([train_df, test_df], "BsmtCond", {"Po":"Fa"})
display(train_df[["BsmtCond"]].describe().transpose())
display(test_df[["BsmtCond"]].describe().transpose())
count unique top freq
BsmtCond 1423 3 TA 1311
count unique top freq
BsmtCond 1414 3 TA 1295

Heating and HeatingQC

  • Heating: Type of heating, almost all of the values appear to be GasA which means 'Gas forced warm air furnace', we will drop these features.
  • HeatingQC has one poor value which is similar to fair.
In [ ]:
train_df, test_df = drop_features([train_df, test_df], "Heating")
train_df, test_df = replace_feature_vals([train_df, test_df], "HeatingQC", {"Po":"Fa"})
display(train_df[["HeatingQC"]].describe().transpose())
display(test_df[["HeatingQC"]].describe().transpose())
count unique top freq
HeatingQC 1460 4 Ex 741
count unique top freq
HeatingQC 1459 4 Ex 752

Electrical

There are not a lot of data, and it seems that the price dependent on the fuse box options in a similar way, also the mix option has only one value.
We will convert this feature to a binary one, which means standard or not.

In [ ]:
train_df, test_df = combine_categorical_by_threshold([train_df, test_df], "Electrical", 100)
train_df, test_df = replace_feature_vals([train_df, test_df], "Electrical", {"SBrkr":1, "Other":0})
display(train_df[["Electrical"]].describe().transpose())
display(test_df[["Electrical"]].describe().transpose())
print("Train uniques: {}, Test uniques: {}".format(train_df["Electrical"].nunique(), test_df["Electrical"].nunique()))
count mean std min 25% 50% 75% max
Electrical 1459.0 0.914325 0.27998 0.0 1.0 1.0 1.0 1.0
count mean std min 25% 50% 75% max
Electrical 1459.0 0.916381 0.27691 0.0 1.0 1.0 1.0 1.0
Train uniques: 2, Test uniques: 2

GarageType

The Detached and Carport values appear to be similar, and a carport is an outside garage, which can be considered as detached parking.

In [ ]:
train_df, test_df = replace_feature_vals([train_df, test_df], "GarageType", {"CatPlot":"Detchd"})
display(train_df[["GarageType"]].describe().transpose())
display(test_df[["GarageType"]].describe().transpose())
count unique top freq
GarageType 1379 6 Attchd 870
count unique top freq
GarageType 1383 6 Attchd 853

GarageQual

We will combine Poor with Fair quality, and Excellent with Good.

In [ ]:
train_df, test_df = replace_feature_vals([train_df, test_df], "GarageQual", {"Fa":"Po", "Ex":"Gd"})
display(train_df[["GarageQual"]].describe().transpose())
display(test_df[["GarageQual"]].describe().transpose())
count unique top freq
GarageQual 1379 3 TA 1311
count unique top freq
GarageQual 1381 3 TA 1293

BsmtFinType2

Most of the values appear to be in one column Unfinished, and as we earlier dropped the BsmtFinSF2 feature, we will drop this one as well.

In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["BsmtFinType2"])

PoolQC and MiscFeature

They have a lot of missing values and earlier we dropped PoolArea and MiscVal, so we will drop those too.

In [ ]:
train_df, test_df = drop_features([train_df, test_df], ["PoolQC", "MiscFeature"])

Furthur Analysis

Let's try to see the distribution of Neighborhood.

In [ ]:
data = pd.DataFrame()
data["Neighborhood"] = train_df["Neighborhood"].unique()
data["Houses"] = train_df["Neighborhood"].value_counts().values

fig = px.bar(data, x="Neighborhood", y="Houses" ,color="Neighborhood",title="BarPlot Neighborhood")
fig.show()

Now let's see how the OverallQual depends on Neighborhood.

In [ ]:
plt.figure(figsize=(20,10))
ax = sns.boxplot(data=train_df, x="Neighborhood", y="OverallQual")
ax.xaxis.set_tick_params(rotation=45)

Fill NA

First of all, we will fill the training set, and later on the test set.

In [ ]:
def fill_na_with_str(df, keys):
  for key in keys:
    df[key] = df[key].fillna("NA").astype(str)
  return df

def fill_na_with_median(df, keys):
  for key in keys:
    if np.issubdtype(df[key].dtype, np.number):
      df[key] = df[key].fillna(df[key].median(skipna=True)).astype(int)
  return df

def fill_na_with_random_sample(df, keys, values_to_random):
  for key in keys:
    while df[key].isna().sum() > 0:
      df[key] = df[key].map(lambda x: x if not pd.isna(x) else np.random.choice(values_to_random)).astype(type(values_to_random[0]))
  return df

Filling Training Set NA

In [ ]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 52 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Alley          91 non-null     object 
 5   LotShape       1460 non-null   object 
 6   LandContour    1460 non-null   object 
 7   LotConfig      1460 non-null   object 
 8   LandSlope      1460 non-null   int64  
 9   Neighborhood   1460 non-null   object 
 10  Condition1     1460 non-null   object 
 11  BldgType       1460 non-null   object 
 12  HouseStyle     1460 non-null   object 
 13  OverallQual    1460 non-null   int64  
 14  RoofStyle      1460 non-null   object 
 15  Exterior1st    1460 non-null   object 
 16  Exterior2nd    1460 non-null   object 
 17  MasVnrType     1452 non-null   object 
 18  MasVnrArea     1452 non-null   float64
 19  ExterQual      1460 non-null   object 
 20  ExterCond      1460 non-null   object 
 21  Foundation     1460 non-null   object 
 22  BsmtQual       1423 non-null   object 
 23  BsmtCond       1423 non-null   object 
 24  BsmtExposure   1422 non-null   object 
 25  BsmtFinType1   1423 non-null   object 
 26  BsmtFinSF1     1460 non-null   int64  
 27  BsmtUnfSF      1460 non-null   int64  
 28  HeatingQC      1460 non-null   object 
 29  CentralAir     1460 non-null   object 
 30  Electrical     1459 non-null   float64
 31  GrLivArea      1460 non-null   int64  
 32  BsmtFullBath   1460 non-null   int64  
 33  FullBath       1460 non-null   int64  
 34  HalfBath       1460 non-null   int64  
 35  KitchenQual    1460 non-null   object 
 36  TotRmsAbvGrd   1460 non-null   int64  
 37  Functional     1460 non-null   object 
 38  Fireplaces     1460 non-null   int64  
 39  FireplaceQu    770 non-null    object 
 40  GarageType     1379 non-null   object 
 41  GarageFinish   1379 non-null   object 
 42  GarageCars     1460 non-null   int64  
 43  GarageQual     1379 non-null   object 
 44  PavedDrive     1460 non-null   object 
 45  WoodDeckSF     1460 non-null   int64  
 46  OpenPorchSF    1460 non-null   int64  
 47  Fence          281 non-null    object 
 48  SaleType       1460 non-null   object 
 49  SaleCondition  1460 non-null   object 
 50  SalePrice      1460 non-null   int64  
 51  YearLstCnst    1460 non-null   int64  
dtypes: float64(3), int64(17), object(32)
memory usage: 593.2+ KB

We can see that we have 1460 entries in the dataset, with 50 features (exclude Id and SalePrice).
Also, by observing the data, it looks like there are null values. Let's see which features and how many.

In [ ]:
def print_null_count(df):
  entries_count = len(df) # amount of entries in the dataset
  null_count = 0          # count how many features has null values
  df_null = df.isna().sum().sort_values(ascending=False)
  print("{:10s} \t {:5s} | {:5s}\t| {}".format("Feat. Name", "Count", "%", "Type"))
  for key in df_null.keys():
    if df_null[key] == 0:
      continue
    null_count += 1
    print("{:10s} \t {:5d} | {:.2f}%\t| {}".format(key, df_null[key], df_null[key] * 100 / entries_count, df[key].dtype))
  print("-----------------------------------------")
  print("{} features in total".format(null_count))
In [ ]:
print_null_count(train_df)
Feat. Name 	 Count | %    	| Type
Alley      	  1369 | 93.77%	| object
Fence      	  1179 | 80.75%	| object
FireplaceQu 	   690 | 47.26%	| object
LotFrontage 	   259 | 17.74%	| float64
GarageFinish 	    81 | 5.55%	| object
GarageQual 	    81 | 5.55%	| object
GarageType 	    81 | 5.55%	| object
BsmtExposure 	    38 | 2.60%	| object
BsmtFinType1 	    37 | 2.53%	| object
BsmtCond   	    37 | 2.53%	| object
BsmtQual   	    37 | 2.53%	| object
MasVnrArea 	     8 | 0.55%	| float64
MasVnrType 	     8 | 0.55%	| object
Electrical 	     1 | 0.07%	| float64
-----------------------------------------
14 features in total

There are features in which a null value means that a property is missing this feature, such as Fence, Alley, etc...
All of these missing values can be converted into a string that will indicate the same thing.
These features are:
Alley, Fence, FireplaceQu, GarageQual, GarageFinish, GarageType, BsmtCond, BsmtExposure, BsmtQual, BsmtFinType1.

In [ ]:
fill_na_with_str(train_df, ['Alley', 'Fence', 'FireplaceQu', 'GarageQual', 'GarageFinish', 'GarageType', 'BsmtCond', 'BsmtExposure', 'BsmtQual', 'BsmtFinType1'])

print_null_count(train_df)
Feat. Name 	 Count | %    	| Type
LotFrontage 	   259 | 17.74%	| float64
MasVnrArea 	     8 | 0.55%	| float64
MasVnrType 	     8 | 0.55%	| object
Electrical 	     1 | 0.07%	| float64
-----------------------------------------
4 features in total

MasVnrType and MasVnrArea

Screenshot 2020-12-08 151335

MasVnrType is the type of bricks used to cover the exterior of the house, and MasVnrArea is the total square foot of it.
The types are:

  • BrkCmn - Brick Common
  • BrkFace - Brick Face
  • CBlock - Cinder Block
  • Stone - Stone
  • None - None
    Because there is a None option, it means the missing values do not mean there is no masonry veneer in the property.
In [ ]:
temp_df = train_df.loc[train_df["MasVnrType"].isna()][["MasVnrType", "MasVnrArea"]]
display(temp_df.transpose())
234 529 650 936 973 977 1243 1278
MasVnrType NaN NaN NaN NaN NaN NaN NaN NaN
MasVnrArea NaN NaN NaN NaN NaN NaN NaN NaN

We can see that these missing values are related, all the missing values in one column, are missing in the other.
We will fill first MasVnrType with random samples.

In [ ]:
temp_df = fill_na_with_random_sample(temp_df, ["MasVnrType"], train_df["MasVnrType"].values)
display(temp_df.transpose())
234 529 650 936 973 977 1243 1278
MasVnrType None None Stone None BrkFace None BrkFace None
MasVnrArea NaN NaN NaN NaN NaN NaN NaN NaN

Now after filling MasVnrType, we will fill MasVnrArea with the mean of houses with similar parameters, None will be filled with 0.

In [ ]:
non_null = train_df.loc[~train_df["MasVnrArea"].isna() & train_df["MasVnrArea"] > 0, "MasVnrArea"]

temp_df["MasVnrArea"] = temp_df["MasVnrType"].map(lambda x: 0 if x == "None" else np.random.choice(non_null))

display(temp_df.transpose())

train_df[["MasVnrType", "MasVnrArea"]] = train_df[["MasVnrType", "MasVnrArea"]].fillna(temp_df)
234 529 650 936 973 977 1243 1278
MasVnrType None None Stone None BrkFace None BrkFace None
MasVnrArea 0 0 260 0 200 0 344 0
In [ ]:
print_null_count(train_df)
Feat. Name 	 Count | %    	| Type
LotFrontage 	   259 | 17.74%	| float64
Electrical 	     1 | 0.07%	| float64
-----------------------------------------
2 features in total

LotFrontage

It missing 259 of 1460 values, for now, we will try to fill it with random samples.

In [ ]:
train_df = fill_na_with_random_sample(train_df, ["LotFrontage"], train_df["LotFrontage"])
print_null_count(train_df)
Feat. Name 	 Count | %    	| Type
Electrical 	     1 | 0.07%	| float64
-----------------------------------------
1 features in total

The last missing values is in one sample in Electrical, we will fill it with random sample from the dataset.

In [ ]:
train_df = fill_na_with_random_sample(train_df, ["Electrical"], train_df["Electrical"])
for df in [train_df, test_df]:
  df["Electrical"] = df["Electrical"].astype(int)
print_null_count(train_df)
Feat. Name 	 Count | %    	| Type
-----------------------------------------
0 features in total

No more NA values in the training set!

Filling Testing Set NA

In [ ]:
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 51 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSZoning       1455 non-null   object 
 2   LotFrontage    1232 non-null   float64
 3   LotArea        1459 non-null   int64  
 4   Alley          107 non-null    object 
 5   LotShape       1459 non-null   object 
 6   LandContour    1459 non-null   object 
 7   LotConfig      1459 non-null   object 
 8   LandSlope      1459 non-null   int64  
 9   Neighborhood   1459 non-null   object 
 10  Condition1     1459 non-null   object 
 11  BldgType       1459 non-null   object 
 12  HouseStyle     1459 non-null   object 
 13  OverallQual    1459 non-null   int64  
 14  RoofStyle      1459 non-null   object 
 15  Exterior1st    1458 non-null   object 
 16  Exterior2nd    1458 non-null   object 
 17  MasVnrType     1443 non-null   object 
 18  MasVnrArea     1444 non-null   float64
 19  ExterQual      1459 non-null   object 
 20  ExterCond      1459 non-null   object 
 21  Foundation     1459 non-null   object 
 22  BsmtQual       1415 non-null   object 
 23  BsmtCond       1414 non-null   object 
 24  BsmtExposure   1415 non-null   object 
 25  BsmtFinType1   1417 non-null   object 
 26  BsmtFinSF1     1458 non-null   float64
 27  BsmtUnfSF      1458 non-null   float64
 28  HeatingQC      1459 non-null   object 
 29  CentralAir     1459 non-null   object 
 30  Electrical     1459 non-null   int64  
 31  GrLivArea      1459 non-null   int64  
 32  BsmtFullBath   1457 non-null   float64
 33  FullBath       1459 non-null   int64  
 34  HalfBath       1459 non-null   int64  
 35  KitchenQual    1458 non-null   object 
 36  TotRmsAbvGrd   1459 non-null   int64  
 37  Functional     1457 non-null   object 
 38  Fireplaces     1459 non-null   int64  
 39  FireplaceQu    729 non-null    object 
 40  GarageType     1383 non-null   object 
 41  GarageFinish   1381 non-null   object 
 42  GarageCars     1458 non-null   float64
 43  GarageQual     1381 non-null   object 
 44  PavedDrive     1459 non-null   object 
 45  WoodDeckSF     1459 non-null   int64  
 46  OpenPorchSF    1459 non-null   int64  
 47  Fence          290 non-null    object 
 48  SaleType       1458 non-null   object 
 49  SaleCondition  1459 non-null   object 
 50  YearLstCnst    1459 non-null   int64  
dtypes: float64(6), int64(13), object(32)
memory usage: 581.4+ KB

We can see that we have 1460 entries in the dataset, with 50 features (exclude Id).
Also, by observing the data, it looks like there are null values. Let's see which features and how many.

In [ ]:
print_null_count(test_df)
Feat. Name 	 Count | %    	| Type
Alley      	  1352 | 92.67%	| object
Fence      	  1169 | 80.12%	| object
FireplaceQu 	   730 | 50.03%	| object
LotFrontage 	   227 | 15.56%	| float64
GarageQual 	    78 | 5.35%	| object
GarageFinish 	    78 | 5.35%	| object
GarageType 	    76 | 5.21%	| object
BsmtCond   	    45 | 3.08%	| object
BsmtQual   	    44 | 3.02%	| object
BsmtExposure 	    44 | 3.02%	| object
BsmtFinType1 	    42 | 2.88%	| object
MasVnrType 	    16 | 1.10%	| object
MasVnrArea 	    15 | 1.03%	| float64
MSZoning   	     4 | 0.27%	| object
Functional 	     2 | 0.14%	| object
BsmtFullBath 	     2 | 0.14%	| float64
GarageCars 	     1 | 0.07%	| float64
Exterior1st 	     1 | 0.07%	| object
SaleType   	     1 | 0.07%	| object
BsmtFinSF1 	     1 | 0.07%	| float64
BsmtUnfSF  	     1 | 0.07%	| float64
Exterior2nd 	     1 | 0.07%	| object
KitchenQual 	     1 | 0.07%	| object
-----------------------------------------
23 features in total
In [ ]:
test_df = fill_na_with_str(test_df, ["Alley", "Fence", "FireplaceQu", "GarageQual", "GarageFinish", "GarageType", "BsmtCond", "BsmtExposure", "BsmtQual", "BsmtFinType1"])
print_null_count(test_df)
Feat. Name 	 Count | %    	| Type
LotFrontage 	   227 | 15.56%	| float64
MasVnrType 	    16 | 1.10%	| object
MasVnrArea 	    15 | 1.03%	| float64
MSZoning   	     4 | 0.27%	| object
BsmtFullBath 	     2 | 0.14%	| float64
Functional 	     2 | 0.14%	| object
GarageCars 	     1 | 0.07%	| float64
SaleType   	     1 | 0.07%	| object
KitchenQual 	     1 | 0.07%	| object
BsmtFinSF1 	     1 | 0.07%	| float64
BsmtUnfSF  	     1 | 0.07%	| float64
Exterior2nd 	     1 | 0.07%	| object
Exterior1st 	     1 | 0.07%	| object
-----------------------------------------
13 features in total
In [ ]:
test_df.loc[test_df["BsmtFullBath"].isna(), ["BsmtCond", "BsmtExposure", "BsmtQual", "BsmtFinType1", "BsmtFinSF1", "BsmtUnfSF"]]
Out[ ]:
BsmtCond BsmtExposure BsmtQual BsmtFinType1 BsmtFinSF1 BsmtUnfSF
660 NA NA NA NA NaN NaN
728 NA NA NA NA 0.0 0.0

We can see that the properties that missing the BsmtFullBath actually does not have a basement.

In [ ]:
for key in ["BsmtFullBath", "BsmtFinSF1", "BsmtUnfSF"]:
  test_df[key] = test_df[key].fillna(0)
print_null_count(test_df)
Feat. Name 	 Count | %    	| Type
LotFrontage 	   227 | 15.56%	| float64
MasVnrType 	    16 | 1.10%	| object
MasVnrArea 	    15 | 1.03%	| float64
MSZoning   	     4 | 0.27%	| object
Functional 	     2 | 0.14%	| object
SaleType   	     1 | 0.07%	| object
GarageCars 	     1 | 0.07%	| float64
Exterior2nd 	     1 | 0.07%	| object
Exterior1st 	     1 | 0.07%	| object
KitchenQual 	     1 | 0.07%	| object
-----------------------------------------
10 features in total

Now we will handle the features missing 1-4 values. For the numerical variables, we will fill with the median, and for the categorical variables, we will fill with random samples. Categorical: MSZoning, Functional, SaleType, Exterior1st, Exterior2nd, KitchenQual.
Numerical: GarageCars.

In [ ]:
for key in ["MSZoning", "Functional", "SaleType", "KitchenQual", "Exterior1st", "Exterior2nd"]:
  test_df = fill_na_with_random_sample(test_df, [key], test_df[key])

fill_na_with_median(test_df, ["GarageCars"])

print_null_count(test_df)
Feat. Name 	 Count | %    	| Type
LotFrontage 	   227 | 15.56%	| float64
MasVnrType 	    16 | 1.10%	| object
MasVnrArea 	    15 | 1.03%	| float64
-----------------------------------------
3 features in total

As LotFrontage was missing as well in the training set, we will fill it with the same way we did in the training set.

In [ ]:
test_df = fill_na_with_random_sample(test_df, ["LotFrontage"], test_df["LotFrontage"])
print_null_count(test_df)
Feat. Name 	 Count | %    	| Type
MasVnrType 	    16 | 1.10%	| object
MasVnrArea 	    15 | 1.03%	| float64
-----------------------------------------
2 features in total

MasVnrType and MasVnrArea

They miss almost the same amount of values, probably in the same entries in the data.

In [ ]:
temp_df = test_df.loc[test_df["MasVnrType"].isna()][["MasVnrType", "MasVnrArea"]]
display(temp_df.transpose())
231 246 422 532 544 581 851 865 880 889 908 1132 1150 1197 1226 1402
MasVnrType NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
MasVnrArea NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 198 NaN NaN NaN

Before handling the NaN values, let's fill the specific one which only missing MasVnrType but not MasVnrArea.
We will fill it with the most common value.

In [ ]:
most_freq_type =  test_df.loc[~(test_df["MasVnrType"].isna()) & (test_df["MasVnrType"] != "None"), "MasVnrType"].mode().iloc[0]

test_df.loc[test_df["MasVnrType"].isna() & test_df["MasVnrArea"] > 0, "MasVnrType"] = most_freq_type

temp_df = test_df.loc[test_df["MasVnrType"].isna()][["MasVnrType", "MasVnrArea"]]
display(temp_df.transpose())
231 246 422 532 544 581 851 865 880 889 908 1132 1197 1226 1402
MasVnrType NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
MasVnrArea NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Now, all the other missing values will be filled in the same way we filled it in the training set.

In [ ]:
temp_df = fill_na_with_random_sample(temp_df, ["MasVnrType"], test_df["MasVnrType"].values)
non_null = test_df.loc[~test_df["MasVnrArea"].isna() & test_df["MasVnrArea"] > 0, "MasVnrArea"]

temp_df["MasVnrArea"] = temp_df["MasVnrType"].map(lambda x: 0 if x == "None" else np.random.choice(non_null))
display(temp_df.transpose())

test_df[["MasVnrType", "MasVnrArea"]] = test_df[["MasVnrType", "MasVnrArea"]].fillna(temp_df)
231 246 422 532 544 581 851 865 880 889 908 1132 1197 1226 1402
MasVnrType BrkFace BrkFace BrkFace BrkFace None BrkFace None None None None None BrkFace None None BrkFace
MasVnrArea 422 170 200 182 0 162 0 0 0 0 0 288 0 0 295
In [ ]:
print_null_count(test_df)
Feat. Name 	 Count | %    	| Type
-----------------------------------------
0 features in total

No more NA values!

Training

In [ ]:
categorical_ft = ["MSZoning", "Alley", "LotShape", "LandContour", "LotConfig", "LandSlope", "Neighborhood", "Condition1", "BldgType",
                  "HouseStyle", "RoofStyle", "Exterior1st", "Exterior2nd", "MasVnrType", "Foundation", "CentralAir", "Electrical",
                  "Functional", "Fireplaces", "GarageType", "GarageFinish", "PavedDrive", "Fence", "SaleType", "SaleCondition"]
numerical_ft = ["LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtUnfSF", "GrLivArea", "WoodDeckSF", "OpenPorchSF"]
ordinal_ft = ["OverallQual", "ExterQual", "ExterCond", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "HeatingQC", "BsmtFullBath",
              "FullBath", "HalfBath", "KitchenQual", "TotRmsAbvGrd", "FireplaceQu", "GarageCars", "GarageQual", "YearLstCnst"]
              
all_ft = categorical_ft + numerical_ft + ordinal_ft
print("Categotical", len(categorical_ft))
print("Ordinal", len(ordinal_ft))
print("Numerical", len(numerical_ft))
print("Total", len(all_ft))
Categotical 25
Ordinal 17
Numerical 8
Total 50

Let's split the training set into X as features, and t as the target.

In [ ]:
# divide the data to features and target
t = train_df['SalePrice'].copy()
X = train_df.drop(['SalePrice', 'Id'], axis=1)
In [ ]:
def find_generator_len(generator, use_pbar=True):
    i = 0
    
    if use_pbar:
        pbar = tqdm(desc='Calculating Length', ncols=1000, bar_format='{desc}{bar:10}{r_bar}')

    for a in generator:
        i += 1

        if use_pbar:
            pbar.update()

    if use_pbar:
        pbar.close()

    return i
In [ ]:
# calculate score and loss from cv (KFold or LPO) and display graphs
def get_cv_score_and_loss(X, t, model, k, show_score_loss_graphs=False, use_pbar=True):
    scores_losses_df = pd.DataFrame(columns=['fold_id', 'split', 'score', 'loss'])

    cv = KFold(n_splits=k, shuffle=True, random_state=1)

    if use_pbar:
        pbar = tqdm(desc='Computing Models', total=find_generator_len(cv.split(X)))

    for i, (train_ids, val_ids) in enumerate(cv.split(X)):
        X_train = X.loc[train_ids]
        t_train = t.loc[train_ids]
        X_val = X.loc[val_ids]
        t_val = t.loc[val_ids]

        model.fit(X_train, t_train)

        y_train = model.predict(X_train)
        y_val = model.predict(X_val)
        scores_losses_df.loc[len(scores_losses_df)] = [i, 'train', model.score(X_train, t_train), mean_squared_error(t_train, y_train, squared=False)]
        scores_losses_df.loc[len(scores_losses_df)] = [i, 'val', model.score(X_val, t_val), mean_squared_error(t_val, y_val, squared=False)]

        if use_pbar:
            pbar.update()

    if use_pbar:
        pbar.close()


    val_scores_losses_df = scores_losses_df[scores_losses_df['split']=='val']
    train_scores_losses_df = scores_losses_df[scores_losses_df['split']=='train']

    mean_val_score = val_scores_losses_df['score'].mean()
    mean_val_loss = val_scores_losses_df['loss'].mean()
    mean_train_score = train_scores_losses_df['score'].mean()
    mean_train_loss = train_scores_losses_df['loss'].mean()

    if show_score_loss_graphs:
        fig = px.line(scores_losses_df, x='fold_id', y='score', color='split', title=f'Mean Val Score: {mean_val_score:.2f}, Mean Train Score: {mean_train_score:.2f}')
        fig.show()
        fig = px.line(scores_losses_df, x='fold_id', y='loss', color='split', title=f'Mean Val Loss: {mean_val_loss:.2f}, Mean Train Loss: {mean_train_loss:.2f}')
        fig.show()

    return mean_val_score, mean_val_loss, mean_train_score, mean_train_loss

Now let's divide the features into seperate dataframes, numericals, categoricals, and ordinals.

In [ ]:
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
all_cols = np.array(X.columns)

print('Numerical Cols:', numerical_cols)
print('Categorical Cols:', categorical_cols)
print('All Cols:', all_cols)
Numerical Cols: Index(['LotFrontage', 'LotArea', 'LandSlope', 'OverallQual', 'MasVnrArea',
       'BsmtFinSF1', 'BsmtUnfSF', 'Electrical', 'GrLivArea', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars',
       'WoodDeckSF', 'OpenPorchSF', 'YearLstCnst'],
      dtype='object')
Categorical Cols: Index(['MSZoning', 'Alley', 'LotShape', 'LandContour', 'LotConfig',
       'Neighborhood', 'Condition1', 'BldgType', 'HouseStyle', 'RoofStyle',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'FireplaceQu',
       'GarageType', 'GarageFinish', 'GarageQual', 'PavedDrive', 'Fence',
       'SaleType', 'SaleCondition'],
      dtype='object')
All Cols: ['MSZoning' 'LotFrontage' 'LotArea' 'Alley' 'LotShape' 'LandContour'
 'LotConfig' 'LandSlope' 'Neighborhood' 'Condition1' 'BldgType'
 'HouseStyle' 'OverallQual' 'RoofStyle' 'Exterior1st' 'Exterior2nd'
 'MasVnrType' 'MasVnrArea' 'ExterQual' 'ExterCond' 'Foundation' 'BsmtQual'
 'BsmtCond' 'BsmtExposure' 'BsmtFinType1' 'BsmtFinSF1' 'BsmtUnfSF'
 'HeatingQC' 'CentralAir' 'Electrical' 'GrLivArea' 'BsmtFullBath'
 'FullBath' 'HalfBath' 'KitchenQual' 'TotRmsAbvGrd' 'Functional'
 'Fireplaces' 'FireplaceQu' 'GarageType' 'GarageFinish' 'GarageCars'
 'GarageQual' 'PavedDrive' 'WoodDeckSF' 'OpenPorchSF' 'Fence' 'SaleType'
 'SaleCondition' 'YearLstCnst']

As we do not have a lot of samples to based the predictions on, we will use K-Fold Cross Validation rather than LPO.

Let's run cross-validation on all of the data while converting the categorical features into OHE, and normalize the numerical features.
As we do not have a lot of samples in our training set, we will use K-Fold cross-validation.
Let's see how the validation looks with 10-K-Fold.

In [ ]:
# use column transformer to insert different transformers for each column

ct = ColumnTransformer([
    ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
    ("standard", StandardScaler(), numerical_cols)])
model_pipe = make_pipeline(ct, SGDRegressor(random_state=1))
val_score, val_loss, train_score, train_loss = get_cv_score_and_loss(X, t, model_pipe, k=10, show_score_loss_graphs=True)
print(f'mean cv val score: {val_score:.2f}\nmean cv val loss {val_loss:.2f}')
print(f'mean cv train score: {train_score:.2f}\nmean cv train loss {train_loss:.2f}')


mean cv val score: 0.82
mean cv val loss 31588.53
mean cv train score: 0.88
mean cv train loss 27278.16

We can see that the training data maintain a stable loss and score over the different folds, as expected because the different training sets in each fold have a lot of samples in common, while the validation set in each fold has completely different samples. Still, the validation looks somewhat stable.

Now let's try to see how the different polynomial degrees will affect the data.
We have a lot of features so let's try with a lower degree at first (5-degree).
We will use cross-validation (10-K-Fold) as well to validate how the polynomial degree affects the model.

In [ ]:
# show graph of score and loss by plynomial degree of numerical features
def show_degree_graphs_cv_train(X, t, model, k, max_degree=10):
    numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
    categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
    
    val_train_score_loss_df = pd.DataFrame(columns=['degree', 'split', 'score', 'loss'])
    for i in tqdm(range(1, max_degree), desc='Poly Degree'):
        ct_enc_std_poly = ColumnTransformer([
            ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
            ("standard_poly", make_pipeline(PolynomialFeatures(degree=i), StandardScaler()), numerical_cols)])
        model_pipe = make_pipeline(ct_enc_std_poly, model)
        val_score, val_loss, train_score, train_loss = get_cv_score_and_loss(X, t, model_pipe, k=k, show_score_loss_graphs=False, use_pbar=False)
        val_train_score_loss_df.loc[len(val_train_score_loss_df)] = [i, 'train', train_score, train_loss]
        val_train_score_loss_df.loc[len(val_train_score_loss_df)] = [i, 'cv', val_score, val_loss]

    fig = px.line(val_train_score_loss_df, x='degree', y='score', color='split')
    fig.show()
    fig = px.line(val_train_score_loss_df, x='degree', y='loss', color='split')
    fig.show()
    
show_degree_graphs_cv_train(X, t, SGDRegressor(random_state=1), k=10 ,max_degree=5)

We can see that the model does not perform well when we add polynomial degree at any level, we will remain with 1-degree of each feature.

We have a lot of features, let's try to find the most significant feature to affect the model and train with them.
We will validate each model with its features again with 10-K-Fold cross-validation.
We will use the forward feature selection method using RFE to choose each time a greater number of feature that was most significant.

In [ ]:
def select_features(x, t, k=10):
  initial_state = 10
  features = None
  scores = pd.DataFrame(columns=["num", "val_score", "val_loss", "train_score", "train_loss"])
  for i in tqdm(range(initial_state, len(x.keys()) + 1), desc='Feature Selection'):
    selector = RFE(SGDRegressor(random_state=1), n_features_to_select=i).fit(x, t)
    mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=k, show_score_loss_graphs=False, use_pbar=False)
    if mean_val_score < 0:
      break
    if i == initial_state or mean_val_score > scores["val_score"].max():
      features = selector.support_
    scores.loc[i] = [i, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]

  fig = go.Figure()
  fig.add_trace(go.Scatter(x=scores["num"], y=scores["val_score"]))
  fig.update_xaxes(title_text="Number of features selected")
  fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
  fig.show()

  return features, scores["val_score"].max()

numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct_enc_std = ColumnTransformer([
            ("encoding", OrdinalEncoder(), categorical_cols),
            ("standard", StandardScaler(), numerical_cols)])

X_encoded = pd.DataFrame(ct_enc_std.fit_transform(X))

best_model_ft, best_score = select_features(X_encoded, t, k=10)
print(X.loc[:, best_model_ft].keys())
print("Number of features: {} with score {}".format(len(X.loc[:, best_model_ft].keys()), best_score))
Index(['MSZoning', 'LotFrontage', 'Alley', 'LandSlope', 'Neighborhood',
       'Condition1', 'OverallQual', 'RoofStyle', 'Exterior1st', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'CentralAir', 'BsmtFullBath', 'FullBath', 'KitchenQual', 'Functional',
       'Fireplaces', 'GarageFinish', 'GarageCars', 'Fence', 'SaleType'],
      dtype='object')
Number of features: 25 with score 0.7716188054994961

We can see that the model perform best with 25 features.

Now let's try to change the hyper-paramtrics of SGD-Regressor.

In [ ]:
def select_lr(x, t, k=10):
  scores = pd.DataFrame(columns=["lr", "val_score", "val_loss", "train_score", "train_loss"])
  init_state = 0.0001
  final_state = 0.1
  step = 0.0001
  
  for i in tqdm(np.arange(init_state, final_state + step, step), desc='Learning Rate'):
    selector = SGDRegressor(random_state=1, eta0=i, learning_rate="constant").fit(x, t)
    mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=k, show_score_loss_graphs=False, use_pbar=False)
    if mean_val_score < 0:
      break
    scores.loc[len(scores)] = [i, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]

  fig = go.Figure()
  fig.add_trace(go.Scatter(x=scores["lr"], y=scores["val_score"]))
  fig.update_xaxes(title_text="Learning Rate")
  fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
  fig.show()

  max_val = scores["val_score"].max()
  best_lr = scores[scores["val_score"] == max_val]["lr"].to_numpy()[0]
  return best_lr, max_val

X_best_ft = X.loc[:, best_model_ft]

numerical_cols = X_best_ft.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_best_ft.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct_enc_std = ColumnTransformer([
            ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
            ("standard", StandardScaler(), numerical_cols)])

X_encoded = pd.DataFrame(ct_enc_std.fit_transform(X_best_ft))

best_learnrate, best_score = select_lr(X_encoded, t, k=10)
print("Best learning rate: {} with score {}".format(best_learnrate, best_score))
Best learning rate: 0.0004 with score 0.8056423166041193

We can see that the best results is with learning rate of 0.0004 which is the default of SGDRegressor.
Let's try to play with the max_iter parameter.

In [ ]:
def select_max_iter(x, t, k=10, lr=0.0001):
  scores = pd.DataFrame(columns=["iter", "val_score", "val_loss", "train_score", "train_loss"])
  init_state = 500
  final_state = 10000
  step = 100
  for i in tqdm(np.arange(init_state, final_state + step, step), desc='Max Iterations'):
    selector = SGDRegressor(random_state=1, max_iter=i, eta0=lr, learning_rate="constant").fit(x, t)
    mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=k, show_score_loss_graphs=False, use_pbar=False)
    if mean_val_score < 0:
      break
    scores.loc[len(scores)] = [i, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]

  fig = go.Figure()
  fig.add_trace(go.Scatter(x=scores["iter"], y=scores["val_score"]))
  fig.update_xaxes(title_text="Max iterations")
  fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
  fig.show()

  max_val = scores["val_score"].max()
  best_iter = scores[scores["val_score"] == max_val]["iter"][0]
  return best_iter, max_val

  return best_iter

best_max_iter, best_score = select_max_iter(X_encoded, t, k=10, lr=best_learnrate)
print("Best max iteration: {} with score of {}".format(best_max_iter, best_score))

Best max iteration: 500.0 with score of 0.8056423166041193

We get the same score in each model, so we will use the default of SGDRegressor which is 1000.
Now let's try to change alpha which is the regularization term of SGDRegressor.

In [ ]:
def select_alpha(x, t, k=10, lr=0.0001):
  scores = pd.DataFrame(columns=["alpha", "val_score", "val_loss", "train_score", "train_loss"])
  init_state = 0.0001
  final_state = 0.1
  step = 0.0001
  for i in tqdm(np.arange(init_state, final_state + step, step), desc='Alpha'):
    selector = SGDRegressor(random_state=1, alpha=i, eta0=i, learning_rate="constant").fit(x, t)
    mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=k, show_score_loss_graphs=False, use_pbar=False)
    if mean_val_score < 0:
      break
    scores.loc[len(scores)] = [i, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]

  fig = go.Figure()
  fig.add_trace(go.Scatter(x=scores["alpha"], y=scores["val_score"]))
  fig.update_xaxes(title_text="Alpha")
  fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
  fig.show()

  max_val = scores["val_score"].max()
  best_alp = scores[scores["val_score"] == max_val]["alpha"].to_numpy()[0]
  return best_alp, max_val

  return best_alp

best_alpha, best_score = select_alpha(X_encoded, t, k=10, lr=best_learnrate)
print("Best alpha: {} with score {}".format(best_alpha, best_score))
Best alpha: 0.0004 with score 0.8056273751873698

We can see that the best score is when the alpha parameter is 0.0004

Final Model

Now that we know which are the best features to use with our model, and what are the best parameters, let's train our model with these features and parameters on the whole training set and then predict the test samples.

In [ ]:
features = X.loc[:, best_model_ft].keys()
print(len(features))
features
25
Out[ ]:
Index(['MSZoning', 'LotFrontage', 'Alley', 'LandSlope', 'Neighborhood',
       'Condition1', 'OverallQual', 'RoofStyle', 'Exterior1st', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'CentralAir', 'BsmtFullBath', 'FullBath', 'KitchenQual', 'Functional',
       'Fireplaces', 'GarageFinish', 'GarageCars', 'Fence', 'SaleType'],
      dtype='object')
In [ ]:
t_train = train_df["SalePrice"].copy()
X_train = train_df.drop(["SalePrice", "Id"], axis=1)
X_train = X_train[features]
display(X_train)
MSZoning LotFrontage Alley LandSlope Neighborhood Condition1 OverallQual RoofStyle Exterior1st MasVnrType MasVnrArea ExterQual BsmtQual BsmtCond BsmtExposure CentralAir BsmtFullBath FullBath KitchenQual Functional Fireplaces GarageFinish GarageCars Fence SaleType
0 RL 65.0 NA 1 CollgCr Norm 7 Gable VinylSd BrkFace 196.0 Gd Gd TA No Y 1 2 Gd Typ 0 RFn 2 NA WD
1 RL 80.0 NA 1 Veenker Feedr 6 Gable MetalSd None 0.0 TA Gd TA Gd Y 0 2 TA Typ 1 RFn 2 NA WD
2 RL 68.0 NA 1 CollgCr Norm 7 Gable VinylSd BrkFace 162.0 Gd Gd TA Mn Y 1 2 Gd Typ 1 RFn 2 NA WD
3 RL 60.0 NA 1 Crawfor Norm 7 Gable Wd Sdng None 0.0 TA TA Gd No Y 1 1 Gd Typ 1 Unf 3 NA WD
4 RL 84.0 NA 1 NoRidge Norm 8 Gable VinylSd BrkFace 350.0 Gd Gd TA Av Y 1 2 Gd Typ 1 RFn 3 NA WD
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 RL 62.0 NA 1 Gilbert Norm 6 Gable VinylSd None 0.0 TA Gd TA No Y 0 2 TA Typ 1 RFn 2 NA WD
1456 RL 85.0 NA 1 NWAmes Norm 6 Gable Plywood Stone 119.0 TA Gd TA No Y 1 2 TA Min1 1 Unf 2 MnPrv WD
1457 RL 66.0 NA 1 Crawfor Norm 7 Gable CemntBd None 0.0 Ex TA Gd No Y 0 2 Gd Typ 1 RFn 1 GdPrv WD
1458 RL 68.0 NA 1 NAmes Norm 5 Hip MetalSd None 0.0 TA TA TA Mn Y 1 1 Gd Typ 0 Unf 1 NA WD
1459 RL 75.0 NA 1 Edwards Norm 5 Gable HdBoard None 0.0 Gd TA TA No Y 1 1 TA Typ 0 Fin 1 NA WD

1460 rows × 25 columns

Now let's train the final model on all the samples in the dataset.

In [ ]:
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_train.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()

ct = ColumnTransformer([
    ("encoding_cat", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
    ("standard", StandardScaler(), numerical_cols)],
    remainder='passthrough')

X_train_encode = ct.fit_transform(X_train)

final_model = SGDRegressor(random_state=1, alpha=best_alpha, eta0=best_learnrate, learning_rate="constant")
In [ ]:
X_test = test_df.drop("Id", axis=1)
X_test = X_test[features]

X_test_encode = ct.fit_transform(X_test.astype(str))

Make sure the train and test data in the same shape.

In [ ]:
print(X_train.shape)
print(X_test.shape)
(1460, 25)
(1459, 25)
In [ ]:
print(X_train_encode.shape)
print(X_test_encode.shape)
(1460, 117)
(1459, 117)
In [ ]:
final_model.fit(X_train_encode, t_train)
y_train = final_model.predict(X_train_encode)

rmse = mean_squared_error(t_train, y_train, squared=False)
print("Final RMSE: {}".format(rmse))
Final RMSE: 32399.6198828841
In [ ]:
y_test = final_model.predict(X_test_encode)
In [ ]:
submission = pd.DataFrame({
        "Id": test_df["Id"],
        "SalePrice": y_test
    })
submission.to_csv('submission.csv', index=False)

Score

comp

best

scores

Conclusions

In this assignment, we had to predict the sale price of houses in Ames, Iowa.
We had about 80 features based on the prediction.
I have tried to minimize the number of features by looking at the correlation between the features and the target (the price), and how they affect each other, if it looked like a feature have no important impact on the target, or it has a high correlation with another feature, I dropped it.
Also, I have tried to look for the best hyper-parametric using 10-Fold Cross-Validation rather than with LPO because we did not have a lot of samples based on the prediction.
During the analysis of the data I have tried to do some steps that turned out to be a mistake:

  • I tried to drop the LotArea feature, but later got it back because I wanted to fill the NA values in the Frontage feature, based on it.
  • I tried to normalize the SalePrice with the log function, as was suggested in some other notebooks, despite we have not learned yet normalize methods. I got really bad predictions based on this method (as can see in the score section), and when I didn't normalize the SalePrice at all the final score was much better.

Also, based on the final score, I think that I did a pretty good analysis work with deciding which feature to drop and transform, despite not using any profiling feature.